Big Data Introduction
- Big Data Introduction
- What is big data and Why Big Data?
- Four Vs of Big Data
- Scaling problems with the existing system and how Hadoop resolved them
- What are MapReduce and HDFS
- Different Hadoop vendors in the industry
Unix
- Unix Concepts
- Introduction to Unix
- Basic Unix Commands
- How to write a shell script
HDFS & its Architecture
- Distributed Computing – Name Node and Data Node concepts
- HDFS Introduction and Architecture
- What are blocks in HDFS and how they make Hadoop Fault Tolerant
- What is Secondary Namenode
- What is checkpointing in Hadoop 1.0
- Difference between Hadoop 1.0 vs Hadoop 2.0
- HDFS configuration file and how to change block size on cluster
- Hadoop File System Commands
- Assignment on HDFS
MapReduce and Its Architecture
- Different phases of MapReduce and Execution Flow
- What is Input Split in MR
- Word Count problem In MR
- Joining Problem In MR
- How to develop and submit MR code on Hadoop Cluster
- Assignment on MapReduce
Yarn
- Why Yarn
- Components of Yarn & Architecture
- How Resource Master function
- Node manager responsibilities
- How Application masterwork
- Different Schedulers in Yarn
Sqoop
- What is Sqoop and why it is used
- Import Data from RDBMS to HDFS
- Full vs Incremental Data Import
- Different File formats to Import Data
- Various methods to Import Data
- Performance Tuning
- Sqoop Jobs
- Automate Sqoop using Shell Script
- Sqoop Export from Hadoop to RDBMS
Hive / Impala
- Hive Introduction
- Datatypes in Hive
- Architecture of Hive
- How to create database, table using different file formats
- Different ways to load data in Hive Tables
- Views in hive
- External vs Internal Tables
- Partitioning vs bucketing
- Static vs Dynamic Partitioning
- Joins In Hive
- Map side joins in Hive
- Analytical functions in Hive
- Performance tuning
- Hive shell vs Beeline Shell
- Hive Executions Modes – MapReduce, Tez/Spark
- What is Impala and how it is different from Hive
- Assignment
Scala
- Scala Basics
- Variable, Strings and Numbers
- Arrays, List, tuple, Map
- For loop, if-else and Match
- Functions and Objects/ Class
- What is the case class in Scala
- The Scala REPL
- How to write & Run Scala Program in IDE
- Assignment
Spark
- Introduction to Spark
- What are RDDs?
- How to Create RDDs
- Transformations in RDD
- Actions in RDD
- Lazy evaluation in Spark
- Lineage Graph in RDD
- What are paired RDDs and when are they used
- What are data frames in spark and how they are different from RDDs
- How to create Dataframes
- How to load data from RDBMS into Hadoop using Spark
- How to perform transformations using DataFrame API
- What is broadcast join in Spark
- Cache vs Persist in Spark
- Performance tuning in spark
- What are datasets in spark and how are they different from Dataframes API
- Assignment on Spark